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Introduction 


Transliteration is transformation of text from one script to another, usually based on phonetic 
equivalencies (IBM, 1999). Popularity and simplicity of roman script is a major reason behind the 
motivation of transliteration of language scripts. People and software are getting benefited from 
these transliteration aids. Transliteration from native scripts to Roman script has been achieved for 
many Asian languages including Arabic, Bengali, Persian, Hindi, Punjabi and Urdu. 
Transliteration between Punjabi scripts (Malik, 2006 b) and Hindi to Urdu transliteration (Malik, 
et al., 2008) are key examples of South Asian language transliterations. Different transliteration 
applications like Google Transliteration IME facilitate the users of different languages to 
transliterate from their native scripts to Roman script. It is currently available for 19 different 
languages including Arabic, Bengali, Farsi (Persian), Gujarati, Hindi, Punjabi, Sanskrit and Urdu. 
Transliteration of two scripts of Sindhi has not been achieved yet and in fact not even initiated. 
Sindhi computational linguistics has not received much encouragement in either Pakistan or India 
(Khubchandani, 1970) and works are limited to font design and word processing. But Sindhi 
computing should not only encircle font design and word processing but extensive research is 
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needed in the areas of artificial intelligence, computational linguistics, natural language 
processing, corpus linguistics and script processing (including transliteration) (Rahman, 2009). 

Following sections discuss a brief history of Sindhi language, Perso-Arabic and Devanagari 
scripts of Sindhi language, composition of both scripts, a set of suggested Roman script for Sindhi 
language and an algorithm for transliteration between two scripts of Sindhi language. 


Sindhi language 


Sindhi is an Indo-Aryan language with its roots in ancient history. Sindhi is being spoken by 
approximately 40 million (Sindhi Language Authority, 2009) people in Sindh province of Pakistan 
as well as in several states of India. In Pakistan Sindhi is written in Perso-Arabic script while in 
India Sindhi is written in both Devanagari and Perso-Arabic scripts. Both regions heavily share 
the same vocabulary. 


Sindhi scripts 
Sindhi scripts and their writing systems are briefly described below. 


Perso-Arabic script 


The Perso-Arabic script of Sindhi language consists of 52 letters; most of those are taken mainly 
from Arabic alphabet, some letters from Persian and few modified letters. In Perso-Arabic script 
each letter has one to four forms according to its position beginning, middle, final and standalone. 
Letters in Perso-Arabic script are divided in different types on the basis of phonemes. These 
different types are discussed below with reference to their phonemes, writing style and position in 
a word. 

First type is aspirated consonants. In Perso-Arabic script some of the aspirated sounds are 
written by combining two letters. For example aspirated form of Scan be written by combining £ 
(g) with 4 (h) as 8 (gh) similarly there are some other aspirated consonants. In Roman script “h” 
is combined with the letter of nearby sound. For example “g” + “h” = “gh” is used to represent 4 
(gh). On the other hand there are some aspirated consonants in Sindhi that are represented using a 
single letter like “g” (chh) and “" (th). 

The non-aspirated consonants of Perso-Arabic script are transliterated according to their 
phonemes. At some places we find multiple non aspirated consonants for single phoneme and these 
all have single counterparts in Devanagari script. These are further discussed in section 3. 

There are three main vowels in Perso-Arabic script those are ! s and these vowels when 
come at the beginning of a word are simply treated as non-aspirated consonants; and at the end of 
the word these are treated as vowels. But in the middle of a word they need to be tackled in context 
of nearby letters whether those are vowels or consonants. 

The diacritical marks are essential for correct accent and if missing not only creates ambiguity 
in transliteration but also cause misinterpretation of the words. Thus diacritical marks are equally 
important for avoiding ambiguities in transliteration as those are important for natural language 
processing and speech synthesis (Malik, et al., 2009). Some of the examples of diacritical marks 
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include Zabar, Zer and Pesh. Examples of pairs of words that differ in meaning only because of 
difference in diacritical marks are shown in table 1. 


Table 1. Effects of diacritical marks on meaning. 








Word Meaning Word Meaning 
a Eight ¿l Camel 
gá Lip oS Silent 
ds Laughter ds Skin 
ble To rub ‘Gls To meet 

US To do oe To fall 





In Sindhi, there are four implosive stops. Using Perso-Arabic script those are represented by 
adding extra “dot(s)” to the letter of nearby matching sound. For example implosive version of 4 
(g) is S (gg) similarly there are three more implosive stops or implosive sounds. In Roman script 
these letters are represented by doubling the letters of nearby sounds like: “bb”, “jj”, “dd”, “gg” as 
this convention is commonly used in Sindhi-Roman script. 


Devanagari script 


Devanagari script adopted from Sanskrit system of writing in which each character represents a 
syllable. Devanagari script is written from left to right. Many of the letters having same phonemes 
or sounds in Perso-Arabic script are equivalent of a single letter in Devanagari script. It is also 
worth mention here that two letters of Perso-Arabic script have no equivalent in Devanagari script 
those are « (‘a) and ¢ (A). 

Devanagari script also have aspirated and non-aspirated consonants but unlike Perso-Arabic 
script the Devanagari script do not use composite letters for aspirated consonants. Hence all 
aspirated consonants are denoted by a single letter. In Devanagari script two types of vowels are 
independent and dependent vowels. Independent form of a vowel is used at the beginning of a 
word and dependent form is used at the end of a word. While in the middle, usually dependent 
form is used but there are some exceptions. 

Unlike Perso-Arabic script the diacritical marks are not optional in Devanagari script. As shown 
in Example 1. 


Example 1 

alent sare. 

tooN suhiNree AaheeN. 
You beautiful are 

oal gigte Os 

You are beautiful. 
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As shown in table 2 diacritical marks are most widely used and are integral part of words 
written in Devanagari script. Thus transliteration accuracy is assured while going from Devanagari 
to Perso-Arabic script. 


Table 2. Diacritical marks in Devanagari 














Devanagari Roman Perso-Arabic 
HGS = tooN us 
ater WHEN = gfo suhiNree sighs 
aT+e+ot+e) = ard AaheeN ual 





Table 3. Simple consonants and independent vowels in Devanagari, Roman and Perso-Arabic 
scripts. 








Devanagari Roman Perso-Arabic 
TT Aa j 
A a \ 
a b = 
q bh S 
a th 4 
z T Š 
3 Thh = 
q p = 
a j € 
a jh > 
al nn & 
a ch id 
5 chh ia 
kej khh a 
q d 4 
gq dh 3 
s D 3 
g Dh 2 
T r 2 
S R > 
EU sh ee 
T G € 
p f = 
% ph = 
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Eg q 3 
mh k = 
a kh E 
T g 4 
q gh Ś 
€ Ng Š 
a 1l J 
H m ê 
q n Ù 
gy N O 
T Nr Ö 
q Vv 3 
T G 





Perso-Arabic, Devanagari and Roman scripts 


Roman script is based on the alphabet developed by the ancient Romans, and used by most of the 
languages of Europe, including English, French, and German (SIL International, 2003). To achieve 
Sindhi transliteration an intermediate Roman script is used. As writing Sindhi and other languages 
in Roman script (English) is very common nowadays, so in this model, all possible steps are taken 
to preserve most common Roman style of writing. Therefore one do not feel any difficulty in 
adopting this method and can transliterate in any direction from Sindhi to Sindhi (Perso-Arabic, 
Roman and Devanagari or vice versa). Most of the consonants are transliterated to their matching 
sounds in Roman script. 

Perso-Arabic, Devanagari and equivalent Roman script mapping is shown in table 3. Table 3 
contains all consonants except those having same phonemes with others and implosive stops. Four 
unique implosive stops in Sindhi are shown in table 4. 


Table 4. Implosive stops 








Devanagari Roman Perso-Arabic 
a bb Q 
x jj T 
g dd à 
1 gg $ 





Besides the letters listed in table 3 and table 4 there are multiple letters in Sindhi, using Perso- 
Arabic script with same equivalent in Devanagari script, as shown in table 5. This is because of 
similar sounds, though we have suggested separate equivalents in Roman script for these letters. 

The letter ¢ came in Sindhi alphabet from Arabic. Native Sindhi speakers do not pronounce it 
properly (generating sound from inner throat) in normal conversations and there is no equivalent 
of ¢ (A) in Devanagari script (Malik, 2006a). Same is true for + (‘a) of the Perso-Arabic script. 
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These two letters can be transliterated easily from Perso-Arabic to Roman script and vice versa. In 
case of Devanagari transliteration these are either ignored or transliterated into a (a) or = (e). 
Mostly these letters are ignored during transliteration of Perso-Arabic or Roman script to 
Devanagari script as shown in example 2 and table 6. 


Example 2. 

| sas Liles 

mAaaf kajo! 

forgive do 

aM spoil | 

Do forgive! / Forgive me! 


Table 5. Perso-Arabic multiple consonants with same Devanagari equivalents 








Devanagari Roman Perso-Arabic 
a t Š 
q Tt L 
B H Z 
z h > 
S Z ïj 
S Z J 
st ZZ ua 
S Zz L 
a S us 
a S ġa 
a c 4 





Table 6. Letters with no equivalents 








Devanagari Roman Perso-Arabic 
q m ê 
2 A E 
oT aa | 
P f (ar) 





We can clearly illustrate from example 2, by further analyzing the word =la (mAaaf) in table 
6 that the use of ¢ (A) has been completely omitted in transliteration from Roman / Perso-Arabic 
to Devanagari. Those two letters are separately shown in table 7, with their equivalent Roman 
letters. 


Table 7. Perso-Arabic letters with no Devanagari equivalent 
Devanagari Roman Perso-Arabic 
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2 A € 
- ‘a ç 
There are two special words that are written in some special form by using single letter and 
two elongated quotation marks beneath the letter. These are shown in table 8 in all three scripts. 
Note that roman representation of these letters is capitalized to avoid ambiguity with other letters 





in the word or sentence. 


Table 8. Special single letter words of Sindhi 








Devanagari Roman Perso-Arabic 
ù AEN $ 
ï MEN â 











The dependent vowels and diacritical marks are shown in table 9 in all three scripts. These 
are shown in combination with letter “a” (j) in Devanagari, “j” in Roman script and “g” (j) in 


Perso-Arabic script to make it more understandable. 


Table 9. Dependent vowels and diacritical marks 








Devanagari Roman Perso-Arabic 
q ja d 
a= ji eo 
IH ju č 
aHA=À jo = 
T+ =H joo = 
+À je i> 
w+ ot att jee EENS 
+03 jaa l 





Sample conversions and problems 


The dictionary lookup is used for transliteration of the words, in which one or more letters follow 
none of the rules. The conversions from one script of Sindhi to any other script can be achieved 
by implementing the rules given below: 


i. Take a whole word as input. 
ii. Dictionary lookup for especial words. 
iii. If not in dictionary, start transliterating. 
iv. Transliterate the first letter as a consonant or independent vowel. 
v. From second to second last letter if any letter is consonant or vowel with a diacritical mark, 
transliterate it as consonant. 
vi. If the letter is a vowel and it has no diacritical mark, transliterate it as dependent vowel. 
vii. If last letter is vowel, transliterate it as dependent vowel. 
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Input Source Sindhi script 


Dictionary Lookup 


If source is Roman 


Output Roman Transliteration Algorithm Source to 


Script Roman 


Transliteration Roman to 
Destination 


Output Destination 





Figure 1. Transliteration model for Sindhi scripts. 


These rules get more complex while transliterating from Perso-Arabic to Devanagari 
when there are words without proper diacritical marks. In this situation transliteration is done by 
analyzing the letters that come before and after the letter that have no diacritical marks. If there is 
no match with any condition (for example in case of consecutive vowels) then finally 
transliteration can be achieved on probability basis. A sample transliteration is shown in example 
3. 


Example 3. 

R Stet Tt sare. 

bbaahir ddaaDhee garmee Aahe. 
outside very hot is 


al ga 8 gÀ Ah 
It is very hot outside. 


The suggested set of rules transliterates majority of sentences correctly like in example 3. 
However there are some ambiguities for the letters that are not properly present in Devanagari 
script. For Example the letter + (‘a) of Perso-Arabic script is sometimes equivalent of a (a) in 


Devanagari, while 3 (a) is actually equivalent of ! (a) of Perso-Arabic script. Similarly same + (‘a) 
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is sometimes equivalent to = (e) (equivalent for Perso-Arabic  (e)) and sometimes the e (‘a) is 
completely omitted to achieve the correct transliteration. As we can see in example 4 the letter a 
(a) of first word is wrongly transliterated into letter | (a) of the Perso-Arabic Script while its correct 
transliteration would be (‘a) «. 


Example 4. 

eat Stet fant sre. 

hoo’a ddaaDhee piyaaree aahe. 
she very lovely is 


st) Sky ga! | 4 (incorrect) 
st) Sky eA é 54 (correct) 
She is very lovely. 


Conclusion and future work 


After successful implementation of transliteration model discussed, one would be able to 
transliterate Sindhi from one script to another. People familiar with one script, will be able to 
understand the writings in other script. It will also be useful in implementing simple Roman to 
Sindhi (any of two scripts) transliteration. By implementation of successful transliteration design, 
the transliteration aids like Google Transliteration IME would be able to use Roman to Sindhi 
mapping, to make it possible to transliterate between Roman and native Sindhi scripts. Automatic 
transliteration will help to end up the discussions and dispute of Roman script adoption for Sindhi 
language. The proposed model needs to be checked on large scale by applying the algorithm on a 
reasonably large corpus. 
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